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ABSTRACT 

Location-based services (LBS) have become more and more 
ubiquitous recently. Existing methods focus on finding rele- 
vant points-of-interest (POIs) based on users' locations and 

query keywords. Nowadays, modern LBS applications gen- 
erate a new kind of spatio-textual data, regions- of -interest 
(ROIs), containing region-based spatial information and tex- 
tual description, e.g., mobile user profiles with active regions 
and interest tags. To satisfy search requirements on ROIs, 
we study a new research problem, called spatio-textual sim- 
ilarity search: Given a set of ROIs and a query ROI, we find 
the similar ROIs by considering spatial overlap and textual 
similarity. Spatio-textual similarity search has many impor- 
tant applications, e.g., social marketing in location-aware 
social networks. It calls for an efficient search method to 
support large scales of spatio-textual data in LBS systems. 
To this end, we introduce a filter-and-verification framework 
to compute the answers. In the filter step, we generate signa- 
tures for the ROIs and the query, and utilize the signatures 
to generate candidates whose signatures are similar to that 
of the query. In the verification step, we verify the candi- 
dates and identify the final answers. To achieve high per- 
formance, we generate effective high-quality signatures, and 
devise efficient filtering algorithms as well as pruning tech- 
niques. Experimental results on real and synthetic datasets 
show that our method achieves high performance. 

1. INTRODUCTION 

Nowadays, as mobile devices (e.g., smartphones) with built- 
in global position systems (GPS) become more and more 
popular, location-based services (LBS) have been widely ac- 
cepted by mobile users and attracted significant attention 
from both the academic and industrial community. Many 
location-based services, such as Foursquare^ and Facebook 
Places'^, brills uuiciue location-aware experiences to users. 

^http: / / foursquare.com 

^http://www. facebook.com/placesband 
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Existing LBS systems employ a spatial keyword search 
approach to provide LBS services [9, 7], which, given a set 
of points of interest (POIs), and a user query with location 
and keywords, finds all relevant POIs. For example, if a 
user wants to find the coffee shops nearby, she can issue 
a keyword query "coffee shop" to an LBS system, which 
returns the relevant coffee shops by considering the user's 
location and query keywords. 

Recently, many modern LBS applications generate a new 
kind of spatio-textual data, regions- of-interest (ROIs), con- 
taining region-based spatial information and textual descrip- 
tion. For example, in Facebook Places, mobile users have 
profiles consisting of active regions and interest tags. In 
wildlife monitoring, wild species with their habitats and de- 
scriptive features can be modeled by ROIs. 

To satisfy search requirements on ROIs, we introduce a 
new research problem, called spatio-textual similarity search 
in this paper: Given a set of ROIs and a query ROI, we aim 
to find the ROIs which are similar to the query by consid- 
ering spatial overlap and textual similarity. 

Spatio-textual similarity search can satisfy users' infor- 
mation needs in various real applications. The first one 
is location-based social marketing using Facebook Places. 
As mentioned above, in Facebook Places, mobile users have 
profiles that can be modeled by ROIs. For example, consid- 
er some users in Manhattan who are interested in tea and 
coffee. A coffee shop (e.g., Starbucks) can utilize user pro- 
files in Facebook to provide location-specific advertisements 
to the potential customers who not only axe interested in it- 
s products (e.g., {starbucks, mocha, coffee}) but also have 
region-based spatial overlap with its service area. Another 
example is friend recommendation in location-aware social 
networks, e.g., Facebook, Foursquare and Twitter"^. Spatio- 
textual similarity search helps mobile users find potential 
friends with common interests (e.g., playing basketball) 
and overlap regions (e.g., Brooklyn), and thus facilitates 
users to form various kinds of circles with the same interests, 
such as sport games, shopping, and fans' activities. Spatio- 
textual similarity search can also support other application- 
s, e.g., wildlife protection. Wild species have their habitats 
(e.g., Yellowstone National Park for grizzly bears) and fea- 
tures (e.g., mammal, omnivore, etc.). A zoologist can issue a 
query to find all wild species having certain features (e.g., 
mammal) and inhabiting in a specific region (e.g., Idaho). 

In this paper, we formalize the problem of spatio-textual 
similarity search, and study the research challenges that nat- 
urally arise in this problem. A challenge is how to evaluate 

^http://www.twitter.com 
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the similarity between two ROIs. Another challenge is how 
to achieve high search efficiency as LBS systems are required 
to support millions of users and respond to queries in mil- 
liseconds. Given a query ROI, there may be a huge amount 
of ROIs having significant overlaps with the query, thus it 
is rather expensive to find similar answers. Take the real 
dataset Twitter in our experiments (See Section 6) as an 
example. We used a set of query regions in an average area 
of 0.4 square kilometers. Even for one of the small query 
regions, there were, on average, 8000 ROIs overlapping with 
it. Moreover, similarity search needs to consider both spatial 
and textual similarities. To address these challenges, we pro- 
pose an efficient Spatio-tExtuAl simiLarity search method, 
called SEAL-Search. We combine spatial similarity function- 
s and textual similarity functions to quantify the similarity 
between two ROIs. To provide high performance, we in- 
troduce a filter-and-verification framework to compute the 
answers. In the filter step, our method generates signatures 
for spatio-textual objects and queries, and utilizes the signa- 
tures to generate candidates whose signatures are similar to 
those of the queries. In the verification step, it verifies the 
candidates and identifies the final answers. We develop ef- 
fective techniques to generate signatures and devise efficient 
filtering algorithms to prune dissimilar objects. 

To summarize, we make the following contributions. 

• To the best of our knowledge, we are the first to study 
spatio-textual similarity search on ROIs. We propose a 
filter-and-verification framework and signature-based 
filtering algorithms to address this problem. 

• For effective spatial pruning, we devise grid-based sig- 
natures and develop threshold-aware pruning techniques. 

• To utilize spatial and textual pruning simultaneously, 
we judiciously select high-quality signatures and devise 
efficient hybrid filtering algorithms. 

• We have conducted extensive experiments on real and 
synthetic datasets. Experimental results show that our 
algorithms achieve high performance. 

The paper is organized as follows. The problem formu- 
lation and related works are presented in Section 2. We 
introduce a signature-based method in Section 3. We devel- 
op grid-based filtering algorithms in Section 4 and hybrid 
filtering algorithms in Section 5. Experimental results are 
provided in Section 6. Finally, we conclude the paper and 
discuss the future work in Section 7. 

2. PRELIMINARIES 
2.1 Problem Formulation 

Data Model. Our work focuses on supporting similarity 
search for a set of spatio-textual ROI objects (or objects for 
simplicity), O — {oi, 02, . . . 0|o|}. Each object o € O con- 
sists of spatial information o.R and textual information o.T, 
denoted by o = {R,T). The spatial information o.R is a 
region. We use the well-known minimum bounding rectan- 
gle (MBR) to represent region o.R through the bottom-left 
point and top-right point of the MBR. The textual descrip- 
tion o.T is a set of tokens, i.e., {ti,t2, . . . , t|o.T|}, where each 
token t G o.T is associated with a weight w{t) to capture its 
importance. Figure 1 illustrates an example of seven objects, 
each of which has several tokens and a region. 
Query Model. Our paper considers a spatio-textual simi- 
larity search query q that also consists of a region q.R and a 
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Figure 1: An example of spatio-textual similarity 
search with objects {oi, 02, . . . , 07} and query q. 

set of tokens q.T — {ti,t2, . . . , t\q.T\}- Given a set of objects 

= {oi, 02, . . . , 0|(p|}, the answers of query q are a set of 
objects A O similar to q, i.e., for o £ A, 

(1) The Spatial Similarity simR(g,o) > r^, and 

(2) The Textual Similarity simT(g,o) > t-^, 

where r^, and r-^ are respectively spatial and textual similar- 
ity thresholds satisfying < Tf^,T-^ < 1. Notice that we use 
the two thresholds to allow users to determine the spatial 
relevance and textual relevance in a more flexible way. 

We quantify the spatial similarity based on the overlap 
of regions. Given two regions q.R and o.R, their overlap is 
formally defined as the area of the intersecting region of q.R 
and o.R, denoted by | q.RDo.R \ . Note that we use operator 

1 • I to represent both the cardinality of a set and the area of 
a region, if there is no ambiguity. Based on the overlap, we 
consider the Jaccard similarity in this paper. 

Definition 1 (Spatial Similarity). The spatial Jac- 
card similarity for two regions q.R and o.R is defined as 



simR(g, o) = 
where \ q.R U o.R | = | q.R | 4 



! q.Rno.R 1 
\q.RUo.R\ ' 

0.7? I - I q.Rno.R 



For example, the overlap of regions oi.R and q.R in Fig- 
ure 1 is I q.RDoi.R 1= 1000 and | q.RUoi.R \= 4400. Thus 
the spatial similarity is simR(g, oi) = 1000/4400=0.23. Note 
that our method can be easily extended to other overlap- 
based functions, such as Dice Similarity. 

On the other hand, textual similarity measures the simi- 
larity between two token sets q.T and o.T. Many token-set 
based similarity functions have been studied in string simi- 
larity join/search [5, 18, 21], such as Jaccard similarity, Dice 
similarity, Cosine similarity, etc. In this paper, we take the 
weighted Jaccard similarity as an example. 

Definition 2 (Textual Similarity). The textual 
similarity simT(g, o) between q.T and o.T is defined as the 
weighted Jaccard coefficient of the two token sets, i.e., 



simT(g, o) = 



E 



teq.Trio.T 



w{t) 



where w{t) is the weight of token t. 

In this paper we use the inverted document frequency (de- 
noted by idf) as token weight, i.e., w{t)=\n 



|o| 

count{t,0) 



where 
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count(f, O) is the number of objects containing token t. Fig- 
ure 1 shows five distinct tokens (i.e., ti ~ is) with their 
weights. Using the weights, we compute textual similarity 
simT((7,oi) = {w{ti)+w{t2)) / {w{ti)+w{t2)+w{t3)) = 0.58. 

Based on the above-mentioned notations, we formalize the 
spatio-textual similarity search problem. 

Definition 3 (Spatio-Textual Similarity Search). 
Consider a set of spatio-textual objects O = {oi, 02, . . . , 0|o|} 
and a spatio-textual similarity search query q = {R, T,t^,Tj). 
It returns the objects A O such that spatial similarity 
simR(g, o) > and textual similarity simT(g, o) > Tj, i.e., 
A = {o \ o £ O, simR(g, o) > r^, simT(g, o) > r-^}. 

Example 1. Consider the objects {01,02, ... ,07} and a 
query q — (i?q, {fi, t2, ^3}, 0.25, 0.3) in Figure 1. Object 02 — 
{R2, {^1,^2, is}) is an answer since it satisfies siniR — 0.32 > 
rp(0.25) and simy = 1 > t-^(0.3). In contrast, object oi — 
(Ri, {ti,t2} ,) is not an answer due to simp = 0.23 < r^, 
although satisfying simy = 0.58 > Tj . Considering all the 
objects in O, we obtain the answer of q, A = {02} . 

2.2 Related Work 

Spatial Keyword Search: There are many studies on 
spatial keyword search [25, 6, 11, 9, 23, 7, 24, 22, 3, 20, 

4, 16, 14, 13]. One problem is knn based keyword search, 
which, given a query consisting of a location and a set of key- 
words, finds top-fc relevant POIs by considering distance and 
textual relevance. Felipe et al. [9] integrated signature files 
into R-tree, and Cong et al. [7] combined inverted files and 
R-tree. Another problem is region-based keyword search, 
which, given a query consisting of a region and a set of key- 
words, finds the relevant POIs relevant to the keywords in 
the region. The methods addressing the problem also em- 
ployed the R-tree index, and integrated inverted lists of 
keywords into R-tree nodes [25, 11, 6]. 

Our spatio-textual similarity search problem is substan- 
tially different from the above-mentioned problems. The un- 
derlying data is a set of spatio-textual objects consisting of 
regions and tokens (i.e., ROIs), rather than POIs. Moreover, 
the query model is different, and we focus on spatio-textual 
similarity between objects and queries and devise efficient 
filtering algorithms for similarity search. 

String Similarity Search/Join: The problem of string 
similarity search/join has been extensively studied [10, 1, 2, 

5, 18, 19, 12]. Given a set of strings and a query string, 
string similarity search finds the strings whose similarities 
to the query are not smaller than a threshold t. Existing 
studies employed various functions, e.g., edit distance and 
Jaccard similarity, to quantify the similarity. To improve the 
performance, Chaudhuri et. al [5] proposed a prefix filtering 
framework, and Bayardo et al. [2] employed this framework 
to support Jaccard or Cosine similarity functions. 

The basic idea of prefix filtering is to estimate a similarity 
upper bound of two sets using their subsets. Consider a 
string object o and a query string q. The prefix filtering 
framework first maps both strings to sets, denoted by S{o) 
and S{q), and transforms various similarity functions to the 
overlap similarity on sets. More formally, if sim(5, o) > r, 
then the sets satisfy j S{q)r)S{o) \> c, where c is a threshold 
deduced from r. Then, the framework fixes a global order 
on the elements of all sets, and sorts the elements in each 
set based on the global order. Let S^{o) denote the prefix 
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Figure 2: The IR-tree index of objects in Figure 1. 

of set S(o) consisting of the first |S'(o)| — c -f 1 elements. It 
is easy to prove that if j Siq) Pi S[o) |> c, their prefix sets 
must have overlap, i.e., S^{q)C[S^{o) 7^ 0. Therefore, we can 
prune the dissimilar objects satisfying S^{q) Pi S^{o) — 0. 

We utilize the prefix filtering framework to prune dissim- 
ilar objects. Compared with existing methods, our method 
focuses on devising effective threshold-aware pruning based 
on judiciously selected spatio-textual signatures. 

Grid-Based Spatial Index Structures: The grid-based 
spatial indexes, such as Grid File and EXCELL, have been 
studied in spatial databases [17]. These methods decom- 
posed the underlying space into a set of grids, and stored 
POIs into grids for fast access. We utilize grids in a dif- 
ferent way: We employ grids as signatures of spatio-textual 
objects, and develop efficient spatial pruning techniques. 

2.3 Baseline IVIethods 

We introduce several straightforward methods to address 
the problem defined above, and will show their poor perfor- 
mance using experimental results (see Section 6). 

Keyword-first method. The method constructs inverted 
indexes by mapping tokens to objects containing the tokens. 
Given a query, it first finds the objects with simy > r-^ 
as candidates. Then, it verifies whether simR > r^. The 
drawback of the method is that it may generate too many 
candidates, leading to low search performance. 

Spatial-first method. The method first finds the objects 
with simR > as candidates, and then filters the candidates 
whose simy < t-^. The method may also generate too many 
candidates, and cannot find similar objects fast. 

Spatial keyword search based method. We can ex- 
tend the IR-tree method [7] to support our spatio-textual 
similarity search as follows. Specifically, we construct an IR- 
tree to index all objects, where each node contains an MBR 
and an inverted file which maps a token to the child nodes 
containing the token. In particular, we store the spatio- 
textual objects in leaf nodes. Given a query q = {R, T), the 
IR-tree based algorithm traverses the tree from the root to 
its leaf nodes. The algorithm takes an intermediate node 
n as a candidate and visits its descendants, if 1) the spa- 
tial overlap \q.R n n.R\ > Cp and 2) the textual overlap 
"^Zteq Tnn T "^i^) — "-TI where and are thresholds de- 
rived from and r^, which will be discussed respectively 
in Sections 3 and 4. For a leaf node corresponding to an 
object, we verify whether its spatial and textual similarities 
to q are respectively not smaller than t^^ and r-^. 

The algorithm may visit too many unnecessary nodes and 
lead to low search efficiency. We explain it by taking the 
objects in Figure 1 as an example. Using a maximum fan- 
out 3, we can construct an IR-tree as shown in Figure 2, 
where leaf nodes correspond to objects and intermediate n- 
odes (i.e., Rs, R9 and -Rio) are MBRs bounding the objects. 
For simplicity, we use R to represent both tree nodes and 
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their regions. Given query q in Figure 1, the algorithm has 
to visit nodes _Rg, i?9 and .Rio and verifies their leaf nodes to 
report the answer {02}. Obviously, it is unnecessary to visit 
the subtrees rooted at Rq and Rw as none of the five leaf 
nodes is similar to q. Therefore, the IR-tree-based methods 
have poor filtering power due to the hierarchical structure of 
the IR-tree. Moreover, the IR-tree maintains a tree based 
inverted index in each node for mapping tokens to its child 
nodes. Let H denote the height of the IR-tree. In the worst 
case, each token of every object needs to be indexed T-L times, 
and this results in high space complexity. To address these 
problems, in this paper wc propose a novel method Seal. 

3. THE SEAL METHOD 

In this section, we first introduce a fllter-and-verification 
framework in Section 3.1, and then present a textual-based 
filtering algorithm in Section 3.2. 

3.1 A Filter-and- Verification Framework 

To answer a spatio-tcxtual similarity search query effi- 
ciently, we want to prune dissimilar objects and only visit a 
small amount of objects that may be similar to the query. 
To this end, we propose a filter-and-verification framework. 

Step 1 - Filter: Wc jirunc a large amount of objects which 
cannot be similar to query q, and find a candidate set C, 
which is a superset of the answer set A. 
Step 2 - Verification: Wc verify the candidates generated 
in the filter step by checking whether spatial and textual 
similarities of each candidate are respectively not smaller 
than thresholds Tp and Tj , and return the answer set A. 

In this paper, we focus on the filter step and propose 
efficient signature-based filtering algorithms. Consider a 
spatio-textual object o £ O and a query q. Wc denote their 
signatures as S(o) and S(q), where each s G S(o) (or S{q)) 
is called a signature element (or element for simplicity). A 
signature method must satisfy the following property: 

o is similar to q only if S{o) and S{q) are similar. 

Specifically, given a signature similarity function siiii(-) 
and a threshold c deduced from thresholds Tp and Tj, the 
object o is similar to q only if sim(S(g), S(o)) > c. 

A naiVe method for generating candidates enumerates the 

signature of each object o and checks if sim(S(g), S(o)) > c. 
If so, we add o into candidate set C. However this method 
is rather expensive and we want to build indexes to support 
efficient filtering, which will be discussed later. 

Next we introduce our algorithm SealSig and Figure 3 
illustrates the pseudo-code. We first generate the signatures 
for objects in O and build an index on top of the signatures 
(line 2). Then for a query q, we use the index to find its 
candidate set C (line 3) . Finally, we verify the candidates in 
C and return the answer set A (line 4). 

In this paper we focus on generating effective signatures, 
building indexes and devising efficient filtering algorithms. 

3.2 A Textual-Based Filtering Algorithm 

We introduce a textual-based filtering algorithm. 

Textual Signatures. Since object o similar to query q 
must satisfy simT(g, o) > r-^, we can use the tokens in as its 
textual signature, i.e., St(o) = o.T, where each token t € o.T 
is a signature element. Similarly, we can also generate a 
textual signature for q, i.e., 87(9) = g.T. 



Algorithm 1: SealSig {0,q) 



Input: O: An object set; q: A query 
Output: A: Answers of q 

1 begin 

2 Generate signatures for O and build index I; 

3 C = SiG-FlLTER (g, I) ■ 

4 A = SiG- Verify {q, C) ; 

5 end 



Function SiG-FiLTER(g, X) 



Input: q: A query; I: An inverted index 
Output: C: Candidate objects 



1 begin 



Initialize a candidate set C ^ ; 
Generate signature, S{q) ^ GenSig (g) ; 
Compute signature similarity threshold c ; 
for each element s in S(g) do 

Obtain objects in inverted list T[s) ; 
Merge objects with sim(S(g), S(o)) > c to C 



8 end 



Function SiG-VERiFY(g, C) 



Input: q: A query; C: A set of candidate objects 
Output: A: Answers of q 

1 begin 

2 for each object o £ C do 

3 if siiiiR(g, o) > Tp & simT(g, o) > t-^ then 

L ^^^UW ; 

4 end 

Figure 3: A Filter-and- Verification Framework 

Then, we define the similarity between signatures St(o) 
and St(<?) as the weight summation of their common tokens, 
i.e., sim(ST(g),ST(o)) = Et6ST(g)nST(o) and threshold 

Cj = T-^ ■ ^s^Sj(q) 'f(^)- It is easy to prove that simT(q', o) > 



Tj only if I]t6ST(q)nHT(o) ™(*) ^ '^j- For example, we can re- 
spectively generate textual signatures for oi and q as St(oi) = 
{ti,^} and St((?) = {ti,t2,t3}. Given t^ = 0.3, the thresh- 
old Ct can be computed as 0.57. Obviously, since simT(g, o) > 
T^, we have T.t&,{q)ns,{o-r) ^'W ^ 

Indexing Structures. To avoid crmincrating every object 
o £ O for computing sim(S(g), S(o)) , we build an inverted 
index on top of the signatures. Formally, an inverted index 
T consists of a set of inverted lists, each of which maps an 
element s to the objects containing the element, denoted by 
X(s). Figure 4 provides the inverted index of the objects in 
Figure 1. For example, since the signatures of object 03 and 
06 contain element ti, the inverted list of t4 is {03,06}. In 
addition, since 02 has textual signature St(o2) = {ti,t3,t2}, 
the object is contained in the inverted lists of ti, ts and t2- 

Filtering Algorithm. Given a query q with thresholds 
Tp and Tj, the filtering algorithm Sig-Filter in Figure 3 
is utilized to filter dissimilar objects and find the candi- 
dates with signatures similar to S(g), i.e., C = {o \ o € 
O, sim(S(g), S(o)) > c}. Specifically, the algorithm gener- 
ates a signature for q, i.e., S{q) <— GENSlG(g), and computes 
threshold c. For each element s £ S(g), Sig-Filter probes 
inverted list T{s) from the inverted index T, and merges the 
objects in X{s) satisfying sim(S(g), S(o)) > c to C 
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Figure 4: Textual signature based method. 

Example 2. We use an example to illustrate how algo- 
rithm SealSig works using textual signatures, as shown in 
Figure 4- Consider the objects and query q with thresholds 
Tp = 0.25 and t-^ — 0.3 in Figure 1. The algorithm generates 
the textual signature St(o) for every object o £ O, and then 
builds the inverted index on top of the signatures. Given 
query q, the algorithm Sig-Filter generates its textual sig- 
nature 81(9) ~ {ti,t2,ts,} and computes the threshold — 

0. 57. Then, it probes the inverted lists of ti, ts andt2, and 
finds the candidate objects satisfying sim(S(q), S(o)) > c^, 

1. e., C = {oi, 02, 03, 04, 05}. Finally, SiG- Verify verifies the 
candidates and reports the answer A = {02} . 

The algorithm SealSig using textual signatures has a 
limitation that it fails to consider the spatial information. 
Recall that an object o is similar to query q if and only 
if textual similarity simT(g, o) > r-^ and spatial similarity 
simR(g, o) > t^. Obviously, it is rather limited to only con- 
sider the textual information. For example, consider object 
04 in Figure 1. Although the textual similarity is larger than 
T-^, 04 is not similar to q since its region R4 is dissimilar to 
Rq. To address this problem, we propose to generate spatial 
signatures for objects and queries, and devise efficient filter- 
ing algorithms using spatial signatures for spatial pruning 
in Section 4. In order to utilize spatial and textual prun- 
ing simultaneously, we develop more efficient hybrid filtering 
algorithms in Section 5. 

4. GRID-BASED FILTERING ALGORITHM 

In this section, we propose a grid-based filtering algorith- 
m. We first define the grid-based signature in Section 4.1, 
and then devise a threshold-aware pruning technique in Sec- 
tion 4.2. Finally, we discuss how to select grid granularity 
to achieve better performance in Section 4.3. 

4.1 Grid-Based Signatures 

Different from textual signatures, as the spatial informa- 
tion has no inherent elements (e.g., tokens), it is not straight- 
forward to generate spatial signatures for objects in O and 
query q. To address this challenge, we propose to partition 
the entire space TZ which is the MBR of the regions of all 
objects in O, and generate a set of grids. Then, for an object 

0, we use the grids intersecting with region o.R as its spatial 
signature. More formally, let G = {51, 52, • . . , <?|G|} be a set 
of grids in space TZ, where each grid is also an MBR, and 
the grids have the following properties. 

1) Completeness: All grids cover the space, i.e., Uggc ~ 

2) Disjointness: Each pair of different grids is disjoinable, 

1. e., Vi,j, ff i / j, g^ng-j = 0. 
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Figure 5: Threshold-aware pruning. 

We only consider the uniform grids with equal size, i.e., 
\9i\ ~ ISjl (* 7^ j) in this section, and will extend our tech- 
niques to support grids with different sizes in Section 5. Fig- 
ure 1 provides an example of uniform grids G = {(71, ... , gie} 
obtained by a 4 x 4 partition of the space TZ for regions 
{-Ri, R2, ■ ■ . , Rt}. Then, we define the grid-based signature 
of object o (query q), denoted by Sr(o) (SR(g)), as the grids 
intersecting with region o.R (q.R) in Definition 4. 

Definition 4 (Grid-Based Signature). Given objec- 
t o, the grid-based signature ofo is the grids m G intersecting 
with region o.R, i.e., Sr(o) = {g \ g £ G, g O o.R 7^ 0}. 

For example. Figure 5 shows the grid-based signature of 

object O2, Sr(02) = {39, 310,311, ffl3,ffl4,ffl5} ■ 

Next, we define similarity between grid-based signatures 
SR(g) and Sr(o) as weight summation of their common grid- 
s, i.e., sim(SR(g),SR(o)) = J^gesMns^M'^id 1 <l,o), where 
"wig I 9, o) is the weight of grid g with respect to query q 
and object o. Intuitively, the weight captures the degree of 
spatial similarity between q and o contributed by g. 

It is not straightforward to define grid weight due to the 
following reason. Ideally, if grid g is shared by q and o, the 
weight should be the area of the intersecting region of q and 
o in the grid, i.e., w{g \ q,o) = Iq.RD o.RD g\. However, in 
practice, it is expensive to compute \q.R n o.R n g\ at the 
query time. Thus, we propose to employ an upper bound of 
\q.R n o.R n g\ to estimate the weight as. 



1,0) 



{wig I q),w{g j o)}. 



(1) 



where w{g \ o) {w{g | g)) is the weight of grid g with respect 
to object o (or query q), and can be estimated by w{g \ 0) = 
\gno.R\ or w{g \ q) = \gnq.R\. 

In addition, we define the threshold to be the area of 
region q.R multiplied by threshold r^, i.e., ■ |g.-Rj. Now, 
we can prove that the grid-based signature satisfies the key 
property of signatures, as shown in Lemma 1. 

Lemma 1. Spatial similarity satisfies simfi{q,o) > r^^, on- 
ly i/sim(SR(g),SR(o)) > c^. 

Proof. The proofs are in our technical report [8]. □ 

For example, in Figure 5, based on the grids {gi, . . . , gia}, 
we generate grid-based signatures, Sr((J') and Sr(o2), for 
query q and object 02. We also compute grid weights, e.g., 
w'(<7io ! q) = 750, w(gio I 02) = 450. Thus, we obtain sig- 
nature similarity sim(SR(g), Sr(o2)) — 1375. Obviously, we 
have sim(SR(g), Sr(o)) > • \q.R\ = 600. 
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4.2 Threshold- Aware Pruning 

Based on grid-based signatures, a straightforward method 
to filter dissimilar objects is to simply employ the algorithm 
SiG-FlLTER in Figure 3. The algorithm probes inverted lists 
of grids in signature SR(g), and inserts the objects satisfying 
sim(SR(g), Sr(o)) > into candidate set C. However algo- 
rithm SiG-FlLTER has the following limitations. Firstly, for 
a large query region q.R, since it may generate many grids 
as signature for g, the algorithm needs to probe many in- 
verted lists, which may be very expensive. Secondly, when 
probing the inverted list of a grid, wc have to retrieve all 
objects in the list, resulting in high probing costs for long 
inverted lists. 

To address the above-mentioned limitations, we devise a 
threshold-aware pruning technique in this section. The ob- 
jective of our technique is two-fold: 1) to reduce the number 
of probed inverted lists; and 2) to reduce the number of ob- 
jects retrieved from a probed inverted list. To this end, wc 
employ the prefix-filtering [5] mentioned in Section 2.2. 

To use prefix filtering, we first fix a global order on the 
generated signature elements of the objects in O. Then, we 
sort the elements of every object based on the global order. 
In the filter step, for each object o, instead of using signa- 
ture S(o), wc only select a prefix of the signature, denoted 
by S^(o). Similarly, wc also select a signature prefix for 
the query S^{q). The signature prefixes must satisfy that 
E.eswnsw c if S-(g) n S-(o) = 0. 

Prefix Selection. Given a global order of signature ele- 
ments, consider signature S(o) = {si, S2, . . . , S|s(o)|} of ob- 
ject o, where Si is the i-th element based on the global order. 
To select prefix S''(o) = {si, . . . , Sp}, we can remove the last 
elements with weight summation smaller than c and select 
the remaining elements as the prefix, as shown in Lemma 2. 

Lemma 2. Given a similarity threshold c, the signature 
prefix S^(o) = {si, . . . , Sp} can be selected as 

|S(o)| 

p = min{i}, s.t. w(s) < c. (2) 

3 = 1 + 1 

Grid Order. In this paper, wc sort the grids in ascending 
order of the number of the object regions intersecting with 
them''. Formally, let count (g) denote the number of object 
regions in O intersecting with grid g, i.e., count (g) = \{R \ 
RDg ^ 0}|. Then, we sort all grids in G in ascending order 
of count (g). For example, in Figure 5 we sort the grids 
intersecting with q.R as {gr, gio, gii, gi4, gir,, ga} ■ Observed 
from this figure, we have c^ = 600, w{gi5 \ q) = 300 and 
'"'(fl'6 I q) = 250. According to Lemma 2, we can select 
signature prefix as {97, 910,311,314}, i.e., p = 4, because any 
shorter prefix {p < 4) may cause the reduction of weights 
larger than threshold c^, i.e., X]j=p+i ™(.9 \ Q^'^) ^ '^r- 

Therefore, instead of considering all elements in S{q), we 
only need to probe the inverted lists of the ones in S^{q), 
which can reduce the filtering complexity. 

Inverted Index with Threshold Bounds. To further 
improve the performance, we propose to reduce the number 
of retrieved objects in a probed inverted list. Recall that, for 
an object o, we only need to consider the signature elements 



Function Sig-Filter+ (g, T) 



Input: q: A query; I: An inverted index 
Output: C: Candidate objects 

1 begin 

2 Initialize candidate set C <— ; 

3 Generate signature, S{q) •«— GenSig (q) ; 

4 Compute signature similarity threshold c ; 

5 Select prefix S''(g) for query q ; 

6 for each element s in S^{q) do 

7 Find the objects T'^(s) from inverted list X{s) ; 

8 |_ C ■(-CU2:=(s) ; 

9 end 

Figure 6: Filtering with threshold-aware pruning 

in prefix S^{o) rather than S(o). Thus, when probing the in- 
verted list I(s) of element s € S(o), we only need to retrieve 
the objects in I{s) containing s in their signature prefixes 
given threshold c, i.e., X'^{s) = {o \ o £ T{s), s £ S^{o)}. 

A challenging problem is to efficiently compute X''(s) based 
on various thresholds c for difi^erent queries. To address this 
problem, we augment a threshold bound to each object in ev- 
ery inverted list. Specifically, for an object o in an inverted 
list I(s) of element s, we maintain a threshold upper bound 
Cs(o), which represents the maximum threshold that we keep 
for s in o's signature prefix. Thus, if threshold c > Cs(o), 
object o can be pruned from 2r°(s), as shown below. 

Lemma 3. Let Si be the i-th signature element in S(o) 
(where 1 < i < \S{o)\) and c be a signature similarity thresh- 
old. The object can be pruned from T'^(si) if 



C > Csi (o) 



|S(o) 
j=i 



(3) 



''Note that the global grid order will influence the perfor- 
mance of filtering. We do not study the problem in this 
paper due to the space limitation, and take it as future work. 



We store bound Cs{o) for each object o in inverted list 
T{s), and sort the objects in descending order of the bounds. 
Thus, given a threshold c, we can efficiently find I'^(s) = 
{o I o € X{s),Cs{o) > c}, when probing the inverted list of 
element s. Figure 5 provides the inverted lists of eight grids. 
For example, the inverted list of grid gi4 contains objects 
{oi, 02} , each of which is associated with a threshold bound, 
e.g., Cgj^(oi) = 900. Given threshold = 600, we only 
retrieve oi when probing the inverted list of 314, because 
bound 5914(02) = 550 < c^. 

Threshold- Aware Pruning. Based on the technique men- 
tioned above, we devise an improved filtering algorithm SiG- 
FiLTER"'" in Figure 6. Compared with algorithm Sig-Filter, 
SlG-FlLTER"*" only selects signature prefix S^(g) for query q. 
For each element s G S^{q), it only retrieves objects in T'^{s) 
instead of T{s), and merges the objects to the candidates 
C. We use the following example to illustrate how algorithm 
Sig-Filter"*" works using grid-based signatures. 

Example 3. Consider the objects O and query q with 
thresholds = 0.25 and = 0.3 in Figure 1. We gen- 
erate grid-based signatures and build the inverted index with 
threshold bounds as shown in Figure 5. Given query q, we 
first generate its signature prefix, S^{q) = {37,310, flu, <7i4} 
based on threshold c^, = 600 according to Lemma 2. Then, 
for each element in SR(g), we probe its inverted list and only 
retrieve the objects with bounds no smaller than c^^, and ob- 
tain the candidates Cr = {01,02,05,07}. Finally, algorithm 
SiG- Verify reports the answer of q, i.e., A = {02}. 
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Algorithm Sig-Filter,^ can be also applied to textual sig- 
natures. Specifically, we can sort tokens in descending order 
of their idfs, and build the inverted index with threshold 
bounds. Given query q, we only consider the tokens in the 
signature prefix, denoted by Sj(g), and only retrieve the ob- 
jects with thresholds no smaller than c-^ in each inverted 
list. For example, we only retrieve inverted lists of t\ and ts 
in Figure 4, and obtain candidates Ct = {01,02,03,04,05}. 
Notice that algorithm Sig-Filter"*" may produce different 
sizes of candidates when using different signatures. 

4.3 Grid Granularity Selection 

An essential task in algorithm Sig-Filter"'' is to generate 
grid-based signatures, i.e, GenSig. Obviously, the perfor- 
mance of SlG-FlLTER^ is affected by grid granularity and 
a key challenge is to select an appropriate grid granulari- 
ty. More specifically, coarse granularity achieves high filter- 
ing performance but weakens the filtering power and leads 
to low verification performance. On the contrary, the fine 
granularity reduces the number of candidates, but leads to 
low filtering performance. For instance, in Example 3, 05 
is a candidate of query q as o^.R and q.R share grid (715. 
However, the two regions do not intersect with each other 
at all, and thus 05 can be actually pruned. 

To alleviate the problem, we propose a method for select- 
ing grid granularity in this section. We introduce a proba- 
bilistic model to measure the expected query cost for grids 
of specific granularity. To answer a query q, the overall 
cost cost(g) consists of filtering cost costF(g) and verifi- 
cation cost costv(5), i.e., cost(g) = costpi^q) + costv(fj'). 
The filtering cost costF(g) depends on the number of ob- 
jects retrieved from the inverted index, i.e., cost p[q) — 
TTi • X]g6S''(g) where tti is the average cost of retriev- 

ing an object from an inverted list and merging it to can- 
didates. On the other hand, verification cost costy(g) de- 
pends on the number of candidates, i.e., costv(?) = 1^2 ■ \C\, 
where tt2 is the average cost of verifying an object. Thus, we 
have cost(g) = tti ■ J2ges^(q) \^''(9)\ + ^2 • |C|. For example, 
we have cost(g) = Gvri -I- 4n2 for query q in Figure 5. 

Notice that the above analysis is based on a single query. 
To analyze the expected query cost of grid set G with specific 
granularity, we suppose that we have a query workload Q. 
Then, each grid g £ G has a probability P{g) representing 
the likelihood that g is used by queries. In addition, when 
inverted list X(g) is probed by different queries, the returned 
objects T'^{g) may be different. For ease of analysis, we 
consider the worst case that all objects need to be returned, 
i.e., = 12^(5) I- Thus, we can estimate the expected 

query cost of all grids in G with the specific granularity as 

£SSt(G) = ^i-^P(g)-|X(5)|+^2-M, (4) 

96G 

where \C\ is the average size of candidates given the query 
workload. Now, we can define the grid granularity selection 
problem: Find the best set of grids G with specific granu- 
larity that minimize the expected cost cost(G). 

Since it is intractable to solve the problem by considering 
arbitrary grid partition schemes, we devise an approximate 
algorithm for selecting grid granularity as illustrated in Fig- 
ure 7. The basic idea is to decompose the underlying space 
TZ into a grid tree with height H, where the grids in level I, 
denoted by G' , is obtain by a 2' x 2' partition of space 71. 
For example, at level 0, there is 1*1 grid, level 1 partitions 




Figure 7: Grid Granularity Selection 

TZ into 2*2 grids, and so forth. Therefore, we reduce grid 
granularity selection to the problem of finding the best level 
r in the grid tree to minimize the expected cost. 

We devise an approximate algorithm to solve this prob- 
lem. We traverse the grid tree from the root to its leaves, 
and compute the expected cost of each level. For two adja- 
cent levels, I and I + 1 {0 < I < H), we compute the benefit 
of the partitioning as B{l,l -I- 1) = cost(G') — cost(G'"''^). 
If the benefit is smaller than a threshold B, the algorithm 
terminates, where B > is used to balance efficiency and 
storage. Then, we prove that for any B, we can find a level 
I, which satisfies that for any level I > I the filtering benefit 
B{1, 1 + 1) < B as formalized in Lemma 4. 

Lemma 4. VB > 0, there exists a level I which satisfies 
that for levels I > I filtering benefit Bf{1, I + 1) < B. 

We briefly show the correctness of Lemma 4 (The proof is 
in [8]). Based on Equation (4), we have Bi? = ^gi^Qi Bplg'), 

where Brig'') is the benefit of partitioning grid gr' into flne- 
grained grids. We use the example of partitioning gl in- 
to {91 , fl2 1 53i fli} to show how to compute Brig')- Based 
on the probabilistic theory, we have P{gl) ~ ^ Pigf) — 

E,^,- Pigh^) + E,.,,, Piahhl) - Pidhhhi)- since no 

query region can intersects with three grids, P{gig]gk) = 0. 
We denote E.^, P{gh']) + P{ghhlgl) as e. Thus, the ben- 
eflt of partitioning ffl is -KiiY..^ P{gf) ■ {[ligDl " \Ag?)\)- 

e ■ \l{gl)\). With the increase of I, the benefit of parti- 
tioning a grid becomes less and less significant, since 1) 
12^(51)1 ~ eventually decreases, and 2) 9 eventually 

increases since it becomes more likely that queries intersect 
with more grids. Thus, we can find a level I which satisfies 
that for levels / > I filtering benefit Bf{1, I + 1) < B. 

Verification benefit By also has the similar property. How- 
ever, it is difficult to analyze By, as estimating the average 
candidate size \C\ is very hard. Thus, we take theoretical 
analysis as a future work and only show the experimental 
results in Section 6. 

5. HYBRID FILTERING ALGORITHMS 

In this section, we develop hybrid filtering algorithms to 
simultaneously utilize textual and spatial signatures. We 
first introduce the hash-based hybrid signature and present 
a filtering algorithm based on the signature in Section 5.1, 
and then propose the hierarchical hybrid signature to further 
improve the performance in Section 5.2. 

5.1 Hash-Based Hybrid Signature 

A straightforward method is to respectively apply algo- 
rithm Sig-Filter^ using textual and grid-based signatures, 
and compute the intersection of candidate sets Ct and Cr. 
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This method, although reducing the number of candidates, 
may increase the number of objects retrieved from inverted 
indexes, and thus result in higher filtering cost. 

In order to reduce the filtering cost, we propose the hash- 
based hybrid signature to efficiently find possibly similar 
objects. The basic idea is to hash tokens and grids of each 
same object into buckets using a hashing function, and then 
use each bucket as a hybrid signature element. 

Definition 5 (The Hash-Based Hybrid Signature). 
For each object o with textual signature St(o) and grid-based 
signature Sr(o), the hash-based hybrid signature of o is de- 
fined by Sh(o) = {h = {t,g) \ t £ St(o),(; € Sr(o)}, where h 
is a hash value by hashing t and g into a bucket. 

Based on hash-based hybrid signatures, we develop a more 
efficient filtering algorithm Hybrid- S1G-F1LTER+ in Figure 8. 
The algorithm respectively generates textual and grid-based 
signatures St(o) and Sr(o) for each object o £ C It hashes 
tokens in St(o) and grids in Sr(o) into buckets to generate 
a hybrid signature Sh(o). Then, the algorithm builds an in- 
verted index I for the hybrid signatures generated from all 
objects in O. Compared with the inverted list introduced 
in Section 4.2, we augment both spatial and textual thresh- 
old bounds for each object o in each inverted list of element 
h, denoted by c^(o) and cf^{o). The two bounds can be 
computed according to Lemma 3, and satisfy that if either 
c-^ > cJ^{o) or Cr > c^{o), o can be safely pruned from the 
inverted list of element h. Moreover, to avoid generating 
too many inverted lists, we introduce a constraint of index 
sizes to guarantee that the number of hash buckets is smaller 
than a given number, which is explained in Section 5.2. 

Given query q, the algorithm respectively generates textu- 
al and grid-based signatures Sjiq) and SK{q), and computes 
signature similarity thresholds c-^ and c^. Then, it respec- 
tively selects prefixes Sj(g) and S^{q). Next, for each token 
t G Sj{q), the algorithm examines each grid g £ S^(q), and 
computes the hash-based signature element h = it,g). Us- 
ing element h, the algorithm probes its inverted list T{h) and 
retrieves the objects satisfying > c-^ and > c^, , denoted 
by2{=R-=T}(ft) = {o€l{h) I cl{o) > c^,c^(o) > cj. Final- 
ly, the algorithm merges the retrieved objects X^'^'*''^^ (h) to 
the candidate set C. 

Example 4. Figure 9 shows how algorithm Hybrid-Sig- 
FlLTER"*" works. For each object, the algorithm hashes its 
tokens and grids into buckets to generate a hybrid signature. 
For example, the signature of object oi consists of elements 
such as (ti,gio), (tijPii), (ti,<7i4), etc. Given query q with 
prefixes S?(g) = {ti.ia} and Sl{q) = {57, ffio, ffii, 514}, the 
algorithm obtains hash-based hybrid signatures, and probes 
the corresponding inverted lists. When probing a list, the 
algorithm only retrieves the objects satisfying cj^io) > Cj 
and c^(o) > c^. For example, the inverted list of element 
{ti,gi4) only returns oi. Finally, the algorithm produces 
candidates C = {01,02,05}. 



5.2 Hierarchical Hybrid Signatures 

Algorithm Hybrid-Sig-Filter+ generates hybrid signa- 
tures using grids with fixed granularity. As different regions 
can use grids with different granularities, this method has 
the following limitations. Generating coarse-grained grids 



Function Hybrid-Sig-Filter"'" (g, X) 
Input: q: A query; I: A hybrid inverted index 
Output: C: Candidate objects 

1 begin 

2 Initialize candidate set C ; 

3 Generate signatures Sj{q) and SR{q) ; 

4 Compute signature similarity thresholds c-^ and ; 

5 Select prefixes Sj{q) and S^{q) ; 

6 for each token t m Sj{q) do 

7 for each grid g m S^{q) do 

8 Compute the hybrid signature h = [t, g) ; 
Find object list 2:<'^R''^t}(/i) from T ; 

_ C ^CU2:^"'<'"T>(/i) ; 

11 end 

Figure 8: Hybrid signature based filtering algorithm 
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Figure 9: Example of algorithm Hybrid-Sig-Filter^ 

for small regions may weaken the filtering power and intro- 
duce more candidates. For example, the element {ti,gii) 
in Figure 9 involves a dissimilar object 05 as the estimated 
grid weight w{g \ 05) = 200 is much larger than the real 
weight w{g \ 05, q) — 0. On the other hand, generating fine- 
grained grids for large regions may involve too many useless 
signature elements, leading to high costs of both storage of 
inverted lists and filtering. In Figure 1, any fine-grained 
grid covered by 314 is useless to region Ri, because gi4 has 
already provided an accurate grid weight for Ri. 

In order to address the problem, we propose to judiciously 
select hierarchical grids for each token t given a constraint 
of index sizes (i.e., the maximum number of hybrid signa- 
ture elements), and generate hierarchical hybrid signatures 
to improve the performance. We first formalize the hierar- 
chical hybrid signature selection problem as follows. 

Hierarchical hybrid signature selection. Intuitively, 
our objective is to select at most mt hierarchical grids for 
the objects containing each token t and optimize the filtering 
power. As mentioned in Section 4.3, optimizing the filtering 
power can be reduced to minimizing filter and verification 
costs. Since verification is the bottleneck as shown in Sec- 
tion 6.3, we focus on minimizing verification cost, i.e., the 
average size of candidates |C| in this section. It is known 
that the estimation of \C\ is very difficult, so we consider a 
simplified version of the problem. 

Ideally, suppose that we have a set of grids with finest 
granularity, such that each finest grid is totally covered by 
or exclusive from object regions. Using the finest grids, 
we can obtain the most compact candidate set satisfying 
|C| = |.4|. Obviously, the amount of the finest grids must be 
very huge. Therefore, given a number mt, we need to merge 
the finest grids to at most mt hierarchical grids. Different 
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Figure 10: Hierarchical hybrid signatures of ti 

grids merged from the finest grids may lead to different up- 
per bounds of grid weights (See Section 4.1) and result in 
different candidate sizes. To measure quality of the grids, we 
introduce the error of each grid. Formally, we define the er- 
ror as follows. Consider a grid g with inverted list X{g). Us- 
ing uniform assumption, we can estimate the expected size of 



\gno.R\ 

\s\ 



where 



gno.fl 



is the 



inverted list as |I(q)| = > , , , 

probability that query region q.R intersects with each object 
region o.R. For example, consider grid gi3 in Figure 1. We 
can compute \l{gi3)\ = (Isis n Ri\ + \gi3 n R2\)/\g\ = 1.7. 

Definition 6 (Error of Grid). Consider grid g cov- 
ering finest grids {g(, . . . ,gf}. The error of g, denoted by 

error(<7), ^s E,/ {\Ag)\ - lAoBlf- 

Based on the errors of grids, we formally define our hierar- 
chical hybrid signature selection (HSS) problem as follows. 

Definition 7 (The HSS Problem). Given a token t 
and a number mt > 1, find at most mt hierarchical grids Gt 
such that X^ggGt Error(p) is minimized. 

We can prove that the HSS problem is NP-hard by a re- 
duction from a known NP-hard problem, a rectangular par- 
titioning problem in [15]. 

Theorem 1. The HSS problem is NP-Hard. 

To solve the HSS problem, we propose a greedy algorith- 
m to find the grids with the minimum errors, as shown in 
Figure 11. Given all objects indexed by a token t, i.e., I{t), 
the algorithm first constructs a grid tree, and initializes a 
priority queue where elements are sorted in descending or- 
der of their scores. Then, it inserts the root with its grid 
error as the score. Here, for a node n, the algorithm ap- 
proximately computes its error based on the child nodes, 
ERROR(n) = Echiici(n) (I^WI - |2:(chnd(n))|)2. Next, the 
algorithm traverses the grid tree until the queue is empty as 
follows. It removes the front element in the queue, denoted 
by n. If n is a leaf node, the algorithm inserts it into Gt- If 
is an intermediate node, the algorithm examines whether to 
split n using its child nodes A/'c. If the number of grids after 
the splitting is larger than mt, i.e., |5t| + |Q| + \J^c\ — 1 > mt, 
the algorithm does not split the node and only inserts n into 
Qt', otherwise, it inserts the child nodes of n into the queue. 

Figure 10 provides an example to show how algorithm 
HSS-Greedy generates hierarchical hybrid signatures for re- 
gions {Ri, J?2, i?5} of token ti. The algorithm first enqueues 
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Input: T{t): objects containing token t; mt'. A number 

Output: Qt: The selected grids 

begin 

Construct a grid tree GT for objects in I{t) ; 
Initialize a priority queue Q ; 
Q. Enqueue [GT.root, ErrOr(GT. roof)) ; 
while Q IS not empty do 
n i— Q.Dequeue() ; 
if n is a leaf node then Qt U {n} ; 
else 

A/'c Child nodes of n ; 

if \Qt\ + |Q| + \JVc\ - 1 > mt then Qt U {n} ; 
else for Each child node in Nc do 
^ Q. Enqueue (ric, ERROR(nc)) ; 



13 end 



Figure 11: Greedy Algorithm for the HSS problem. 

the root grid g1, i.e., the space TZ. Then, the algorithm re- 
peatedly dequeues the front element, and enqueues its child 
nodes. For example, since ERROR(pi) = 0.1 > Error((72) = 
0.09, the algorithm dequeues grid g\ and enqueues its child 
nodes. Thus, given mt = 8, we obtain hierarchical hybrid 
signatures represented as bold circles in Figure 10. 

Hierarchical signature-based filtering algorithm. Us- 
ing the generated hierarchical hybrid signatures, we can im- 
prove the performance of algorithm Hybrid-Sig-Filter"*" 
(in Figure 8) as follows. For each token t, we first fix a 
global order of its hierarchical grids. We sort the grids in 
ascending order of their levels in the grid tree. For grids in 
the same level, we sort them in ascending order of the num- 
ber of object regions intersecting with them. For example, 
based on the global order, the hierarchical hybrid signature 
of token ti can be sorted as {gl, gl, gl, gl, gl, gl, gl} ■ Then, 
we can employ algorithm Hybrid- Sig-Filter'*' to filter dis- 
similar objects, as illustrated in the following example. 

Example 5. Figure 10 provides an example of algorithm 
Hybrid-Sig-Filter,+ using hierarchical hybrid signatures. 
Consider token ti in textual signature prefix Sj{q). We can 
generate three hierarchical grids {(?2, <?2, <?!} <md compute the 
weight w(g \ q) for each grid. Using threshold- aware prun- 
ing, we can prune inverted lists of grids g2 and gf, cind only 
need to probe the inverted list of hybrid signature {tj,gl). 
Similarly, we probe the inverted lists of other tokens in Sj{q) 
and obtain a candidate setC = {oi, 02} . Notice that the can- 
didate set is more compact than the one in Example 4. 

6. EXPERIMENTS 

In this section, we report experimental results. We ex- 
tended the state-of-the-art spatial keyword search method 
IR-tree [7] to support spatio-textual similarity search as 
mentioned in Section 2.3. We compared our Seal method 
with this method. 

6.1 Experiment Setup 

We used two datasets. The first one was a real dataset 
Twitter. We collected 60 million tweets from May 2011 to 
August 2011, among which about 13 million tweets had lo- 
cations (i.e., points with longitudes and latitudes). We ran- 
domly selected 1 million users, and obtained ROI objects 
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from their profiles. On the one hand, we selected frequent 
words in her tweets as the token set for each user, and the 

average token number of objects was 14.3. On the other 
hand, we took the MBR of all of her tweets as each us- 
er's active region, and the average area of the regions wa.s 
115 square kilometers (sq.km.). Specifically, the distribu- 
tions of the region sizes were: 0.0001 sq.km. (4.4%), 0.01 
sq.km. (15.4%), 1 sq.km. (29.7%), 100 sq.km. (73%), etc, 
where "x sq.km. {y%)" means there are y% regions with 
areas not greater than x sq.km. We can see that the region- 
s had various sizes due to dilferent spatial distributions of 
users' tweets, and most of regions sizes were not very large. 
Note that more complicated methods can be used to obtain 
the active regions from users' tweets. For example, we can 
compute multiple active regions for each user by clustering 
tweets' locations. We take it as a future work, and only fo- 
cus on the general spatio-textual similarity search problem 
in this paper. In addition, we generated two query sets to e- 
valuate the performance of different algorithms. Each query 
set contained 100 queries. 

Large- Region Queries: We generated a set of queries with 
large regions. The average query token number was 6.97, 
and the average area of regions was 554 sq.km, which was 
equivalent to the area of a district. 

Small-Region Queries: We generated a set of queries with 
small regions. The average query token number was 12.9, 

and the average area of query regions was 0.44 sq.km, which 
was equivalent to the area of a small neighborhood. 

We also used a synthetic dataset by combing POIs in USA 
and the publication records in DBLP. We selected 1 million 
POIs from the USA dataset as centers and extended the POIs 
with random widths and heights to generate regions. Then, 
we randomly distributed publication records to the regions 
to generate token sets. The average area of regions was 5.4 
sq.km and the average token number was 12.5. We also gen- 
erated 100 large-region queries and 100 small-region queries 
for this dataset. Note that the experimental results on the 
two datasets were similar. Due to the space constraints, we 
only provide experimental results of method comparison on 
the USA data set in Section 6.5. 

The IR-tree index was disk-resident, and its page size was 



GridFilter on the Twitter data set. 

4KB. The inverted indexes of our signature-based methods 
were also disk-resident, and we maintained an index that 
mapped each signature element to the disk offset of its in- 
verted list in memory. Note that this index was small e- 
nough to be maintained in memory. For example, for the 
Twitter dataset, the index only occupied 19 MB. Table 1 
summarizes the data statistics and index sizes. For simplic- 
ity, we respectively use TokenInv, GridInv, HasiiInv and 
HierarchicalInv to represent inverted indexes of textual, 
grid-based, hash-based hybrid and hierarchical hybrid sig- 
natures. For GridInv and HashInv we also present the 
granularity. For example, GridInv (1024) represents the 
GridInv index with granularity 1024 x 1024. Due to space 
constraints, we only show sizes of the indexes we used in 
method comparison in Section 6.5. Besides, we varied both 
spatial and textual similarity thresholds from 0.1 to 0.5, and 
the default value of the thresholds was 0.4. 

In the paper, we only show the running time and the 
numbers of candidate numbers of different methods are in 
our technical report [8] . 

All the programs were implemented in JAVA and all the 
experiments were run on the Ubuntu machine with an Intel 
Core 2 Quad X5450 3.00GHz processor and 4 GB memory. 



6.2 Token Filter vs. Grid Filter 

We first evaluated algorithm Sig-Filter,+ in Figure 6 
using textual signatures and grid-based signatures, which 
are respectively denoted by TokenFilter and GridFil- 
ter. We examined the GridFilter using grid-based sig- 
natures with different granularities. Figure 12 provides the 
experimental results. We can see that the performance of 
TokenFilter and GridFilter depended on the similarity 
thresholds, and Tj. Observed from Figure 12(a), Token- 
Filter outperformed GridFilter for small Tp. However, 
with the increase of Tp , GridFilter became faster. For ex- 
ample, given r„ = 0.5, GridFilter (1024) took 30 milhsec- 
onds, while TokenFilter, took 64 millisecond. Similarly, 
with the increase of t-^, TokenFilter became better (See 
Figures 12(b) and 12(d)). Therefore, it is better to combine 
both filters instead of using either one individually. 

6.3 Evaluating Different Grid Granularities 

We then evaluated the performance of GridFilter with 
different grid granularities. We partitioned the entire space 
into px p uniform grids as mentioned in Section 4, where p 
denotes the granularity. Then, we ran GridFilter based on 
different granularities and compared the filtering time and 
verification time. Figure 13 shows the experimental results. 
We can see that with the increase of granularity, the verifi- 
cation time always decreased. For example, the verification 
time decreased from 466 milliseconds at granularity 64 to 90 
milliseconds at granularity 8192 in Figure 13(b). Moreover, 
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the decrease of verification cost became more and more in- 
significant as tfie granularity increased. The filtering step 
was not always improved with the increase of granularity, 
which was consistent with our cost-based analysis in Sec- 
tion 4.3. For example, in Figure 13(a), the filtering time 
first decreased and then increased after granularity 1024. 

— ^ ^ ^ ^ ^ ^ ^ 1000 
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(a) Large-Region Queries. (b) Small-Region Queries. 
Figure 13: Evaluation on grid granularity selection 
on the Twitter data set. 

6.4 Evaluating Hybrid Filtering Algorithms 

We evaluated hybrid filtering algorithms mentioned in 
Section 5. We first compared the algorithm using hash-based 
hybrid signatures (HybridFilter) with GridFilter. Fig- 
ure 14 shows the experimental results, where G and H respec- 
tively stand for GridFilter and HybridFilter. We can 
see that HybridFilters significantly outperformed Grid- 
Filters for both large-region and small-region queries. For 
example, in Figure 14(a), HybridFilters with different 
granularities (i.e., 256, 512, 1024) were one order of mag- 
nitude faster than GridFilters with the same granularity. 
This is because HybridFilter utilized both spatial and tex- 
tual pruning simultaneously and had larger filtering power 
than GridFilter with only spatial pruning. 
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(a) Large-Region Query. (b) Small-Region Query. 
Figure 15: Comparison of hybrid signatures on the 
Twitter data set (r,, = 0.4, r^. = 0.1). 

Then, we evaluated the effect of hierarchical hybrid sig- 
natures, which were judiciously selected to improve the per- 
formance of hybrid filtering. We compared filtering algo- 
rithms with hash-based hybrid signatures and hierarchical 
hybrid signatures given different constraints of index sizes 
(i.e., maxirrmm numbers of signature elements as mentioned 
in Section 5.2). Figure 15 shows the experimental results. 
Hierarchical hybrid signatures achieved better performance 
compared with hash-based hybrid signatures in various in- 
dex sizes. For example, in Figure 15(b), the elapsed time 
of algorithm with hierarchical hybrid signatures was 30 mil- 
liseconds given index size 280 MB, while that of hash-based 
hybrid signatures was 70 milliseconds. The main reason is 
that we judiciously selected hybrid signatures with hierar- 
chical grids. These signatures improved the filtering power, 
and thus pruned a large number of dissimilar objects. 

6.5 Comparison with Existing Methods 

We compared our algorithm using hierarchical hybrid sig- 
natures (denoted by Seal) with the keyword-first method 



(Keyword), the spatial- first method (Spatial), and state- 
of-the-art spatial keyword search method IR-tree [7] as dis- 
cussed in Section 2.3. Figures 16 and 17 respectively show 
the experimental results on Twitter and USA datasets. 

We can see that Keyword and Spatial could not ef- 
fectively prune dissimilar objects. Specifically, Keyword 
sometimes performed worse than SPATIAL (Sec Figure 17(a)), 
as it did not have spatial pruning power. On the other hand, 
for large textual thresholds. Spatial might achieve lower 
efficiency (See Figures 16(d) and 17(d)), as it did not have 
textual pruning power. In addition, IR-tree also achieved 
low performance, and it was even worse than Spatial (See 
Figures 16). This is because the method, which is designed 
for spatial keyword search, visited too many unnecessary 
nodes and involved a huge number of dissimilar objects as 
mentioned in Section 2.3. In addition, IR-tree had to probe 
the inverted file associated in each R-tree node, resulting 
in a large overhead. 

Our method Seal always achieved the highest perfor- 
mance for any type of queries and any threshold on the 
two datasets. In Figures 16 and 17, our method was several 
tens of times faster than the baseline methods. For exam- 
ple, in Figure 16(c), given spatial threshold 0.1 and textual 
threshold 0.4, our method took 5 milliseconds, while IR- 
tree, Keyword and Spatial respectively took 253, 52 and 
182 milliseconds. The better performance of our method is 
attributed to the signature-based methods integrating both 
spatial and textual pruning simultaneously, and the hierar- 
chical hybrid signatures we selected for large filtering power. 

6.6 Scalability 

We evaluated the scalability of our hybrid filtering algo- 
rithm by varying the numbers of objects. Figure 18 shows 
the results. We can see that our method scaled very well, 
and with the increase of the numbers of objects, the elapsed 
time increased sub-linearly. This is because our algorithm 
could prune a huge amount of dissimilar objects even if the 
number of objects increases. 
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Figure 18: Scalability on the Twitter data set. 

7. CONCLUSION AND FUTURE WORK 

In this paper, we have studied a new research problem 
called spatio-textual similarity search. We introduced a 
filter-and-verification framework to compute the answers. 
We devised efficient signature-based filtering algorithms and 
developed effective pruning techniques. For spatial pruning, 
we proposed grid-based signatures by decomposing the un- 
derlying space, and developed threshold-aware pruning tech- 
niques. To utilize spatial and textual pruning simultaneous- 
ly, we judiciously selected hierarchical hybrid signatures and 
devised hybrid filtering algorithms. We have implemented 
our method and examined it on real and synthetic datasets. 
Experimental results show that our method achieves very 
high search performance. 



834 



m 100 



G-256 I 1 H-512 

H-256 G-1024 
G-512 m*mm H-1024 



Lu 




(a) 



0.1 0.2 0.3 0.4 0.5 

Spatial Similarity Threshold 

Large-Region Query. 




G-256 [ 
H-256 I 
G-512 I 



H-512 I 
G-1024 I 
H-1024 I 



I I I I I 




Textutal Similarity Threslnold 
(b) Large-Region Query. 



Spatial Similarity Threslnold 
(c) Small-Region Query. 



Textual Similarity Threshold 
(d) Small-Region Query. 



Figure 14: Comparison of grid-based and hybrid filters on the Twitter data set. 
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Figure 17: Comparison with existing methods on the USA data set. 



We believe this study on spatio-textual similarity search 
opens many new interesting and challenging problems that 
need further research investigation, such as how to extend 
the textual similarity measure to more sophisticated schemes, 
how to provide a theoretical analysis of the approximate so- 
lutions presented in this paper, etc. 
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